Capable of speaking, singing, and even playing tricks! Xiaomi launches the MiMo-V2-TTS large model: dialects and emotions are handled effortlessly
Xiaomi launches its self-developed large-scale Text-to-Speech model, MiMo-V2-TTS, achieving a transition from mechanical repetition to emotional resonance. The model is based on an audio tokenizer and a multi-codebook joint architecture, pre-trained on hundreds of millions of hours of voice data, and is capable of acting, speaking, singing, and more, demonstrating versatile voice generation potential.